A really Simple Approximation of Smallest Grammar
نویسنده
چکیده
In this paper we present a really simple linear-time algorithm constructing a contextfree grammar of size O(g log(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet Σ of the input string can be identified with numbers from {1, . . . , N} for some constant c. Algorithms with such an approximation guarantee and running time are known, however all of them were non-trivial and their analyses were involved. The here presented algorithm computes the LZ77 factorisation and transforms it in phases to a grammar. In each phase it maintains an LZ77-like factorisation of the word with at most ` factors as well as additional O(`) letters, where ` was the size of the original LZ77 factorisation. In one phase in a greedy way (by a left-to-right sweep and a help of the factorisation) we choose a set of pairs of consecutive letters to be replaced with new symbols, i.e. nonterminals of the constructed grammar. We choose at least 2/3 of the letters in the word and there are O(`) many different pairs among them. Hence there are O(log N) phases, each of them introduces O(`) nonterminals to a grammar. A more precise analysis yields a bound O(` log(N/`)). As ` ≤ g, this yields the desired bound O(g log(N/g)).
منابع مشابه
Approximation algorithms for grammar-based data compression
This thesis considers the smallest grammar problem: find the smallest context-free grammar that generates exactly one given string. We show that this problem is intractable, and so our objective is to find approximation algorithms. This simple question is connected to many areas of research. Most importantly, there is a link to data compression; instead of storing a long string, one can store a...
متن کاملThe Smallest Grammar Problem Revisited
In a seminal paper of Charikar et al. on the smallest grammar problem, the authors derive upper and lower bounds on the approximation ratios for several grammar-based compressors, but in all cases there is a gap between the lower and upper bound. Here we close the gaps for LZ78 and BISECTION by showing that the approximation ratio of LZ78 is Θ((n/ logn)), whereas the approximation ratio of BISE...
متن کاملThe Generalized Smallest Grammar Problem The Generalized Smallest Grammar Problem
The Smallest Grammar Problem – the problem of finding the smallest context-free grammar that generates exactly one given sequence – has never been successfully applied to grammatical inference. We investigate the reasons and propose an extended formulation that seeks to minimize non-recursive grammars, instead of straight-line programs. In addition, we provide very efficient algorithms that app...
متن کاملThe Generalized Smallest Grammar Problem
The Smallest Grammar Problem – the problem of finding the smallest context-free grammar that generates exactly one given sequence – has never been successfully applied to grammatical inference. We investigate the reasons and propose an extended formulation that seeks to minimize non-recursive grammars, instead of straight-line programs. In addition, we provide very efficient algorithms that app...
متن کاملApproximation of smallest linear tree grammar
A simple linear-time algorithm for constructing a linear context-free tree grammar of size O(rg + rg log(n/rg)) for a given input tree T of size n is presented, where g is the size of a minimal linear context-free tree grammar for T , and r is the maximal rank of symbols in T (which is a constant in many applications). This is the first example of a grammar-based tree compression algorithm with...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Theor. Comput. Sci.
دوره 616 شماره
صفحات -
تاریخ انتشار 2014